Europhysics Letters 



PREPRINT 



Phase transition in the assignment problem for random 
matrices 

J. G. Esteve 1 ' 2 and F. Falceto 1 ' 2 

1 Departamento de Fisica Teorica, Universidad de Zaragoza. Zaragoza, Spain. 

2 Instituto de Biocomputacion y Fisica de Sistemas Complejos, Universidad de Zaragoza. 
, Zaragoza, Spain. 

> 
O 

o 

CO 



U 
U 



X 



PACS. 02.60.Pn - Numerical optimization. 
PACS. 02.70.Rr -General statistical methods . 
PACS . 64 . 60 . Cn - Order-disorder transformations. 



Abstract. - We report an analytic and numerical study of a phase transition in a P prob- 
lem (the assignment problem) that separates two phases whose representatives are the simple 
matching problem (an easy P problem) and the traveling salesman problem (a NP-complete 
problem) . Like other phase transitions found in combinatoric problems (K-satisfiability, number 
partitioning) this can help to understand the nature of the difficulties in solving NP problems 
an to find more accurate algorithms for them. 



> 

In the theory of computational complexity, two paradigmatic problems representative of 
the classes A^P-complete and P are the traveling salesman and the assignment problem. 
Traveling salesman problem (TSP) requires N points or cities, with distances dij between 
them, for which we must find the closed tour of minimum length that visits each city once. In 
the assignment or bipartite matching problem (AP), we have N objects of two classes (say A 
c/i ■ and B) with distances dij between each object i of the class A and the object j of the class 
B. We must assign at each object i of the class A one and only one object a{i) of the class B 
in such a way that the total distance J3i=i ^i,crU) 1S a minimum, so we can write D(AP) as: 



D(AP) = Min (ffeSjv) 



' N 



(i) 



being Sn the symmetric group. This problem can be seen in another way when the sets A 
and B are the same (for example cities), in this case D(AP) is the path of minimum length 
that visits each city once and is composed of so many closed sub tours as needed. In other 
words, when A = B the bipartite matching can be seen as a problem of n-traveling salesmen 
where each salesman must start and end his tour in the same city and we can use as many 
salesmen as needed in order to minimize the total length of all salesmen tours. Although 
the configuration space for the AP problem has AH elements while that of the TSP has only 
(N — 1)!, the AP is in the class P and can be solved in times that grows like A^ 3 with the 
number N of cities. 
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There is another interesting limit for the AP, when we introduce the constrain that the 
number of tours must be N/2 (for N even) or equivalently, that the N cities must be pair- 
wise matched. This limit is called simple matching problem (SMP) and in this case the 
configuration space has (N — 1)!! elements and the problem remains in the class P. 

Depending on the characteristics of the distance matrix the AP could, in principle, inter- 
polate between those situations which are near of the SM problem (that is, situations where 
the optimal solution is composed approximately of N/2 cycles) and those which are near of 
the TSP problem (when the optimal solution is composed only of a few cycles). We shall see 
that the control parameter that governs the transition between both limits is the correlation 
between the distances djj and dj $ [1], in such a way that for positive correlations the optimal 
solution for the AP is in the "SMP regime" , whereas for anti correlated distances the solution 
is in the "TSP regime" [2]. 

The aim of this work is to study the structure (with reference to the number of cycles) of the 
optimal solution for the bipartite matching with random distance matrices e.g. matrices whose 
elements dij are random numbers with a distribution of probability p(dij). The problem of 
bipartite matching on random matrices has been studied for many years, but those studies 
focused on the length of the minimal path D (AP). For example for random matrices whose 
probability distribution is p{dij) = cxp (— rfjj) it was conjectured first by G. Parisi [3] and 
then proved rigorously [4]- [6] that the average length is D(AP) = X, with N the 

number of points that must be matched. For more general distributions the first terms of 
the expansion in \/N of D(AP) are known ( [7]- [11]). As for the TSP on random matrices 
with p(0) = 1, the averaged length of the minimal tour is D(TSP) = 2.041... + 0(1/N) and 
the 1 /N corrections can be computed in principle although "the computations become rather 
long" [12]. 

We shall work with random distance matrices given by 

did = n -j + XR ji ' + 3 ( 2 ) 
d%,i = oo, 

where Rij, i ^ j are independent random variables with uniform distribution in the interval 
[0, 1] and A G [—1,1]. d^ = oo is imposed in order to make every salesman to travel out of 
his home city i. e. we do not allow 1-cycles [13]. A = corresponds to a random matrix 
with all entries uncorrelated, A = 1 is the symmetric case dij = dj t % = + Rji and A = — 1 
the antisymmetric one dij — Rij — Rji — —djj. For each value of A we generate M different 
distance matrices (typically M varies between 10 4 and 10 6 ) and we solve for each of them 
the assignment problem using the algorithm of R. Jonker and A. Volgenant [14]. Then we 
measure the mean value of the number of cycles (n c ), which is plotted in Fig. ^ 

There we can see two very different regimes for the behavior of (n c ) as a function of A. 
When the correlations between dij and dj^ are negative (—1 < A < 0), (n c ) is (almost) 
constant with A, has a logarithmic dependence with N i. e. the solution for the assignment 
problem is composed typically of a few cycles, close to the TSP problem. This situation 
contrasts with that of positively correlated d% $ and dj i (0 < A < 1), where (n c ) varies with a 
nearly linear dependence in A and AT, reaching its maximum for A = 1 with an approximate 
value (n c ) x=1 ~ N/2. At this point the solutions are very close to those of the SM problem 
in the sense that they are dominated by 2-cycles. 

The cases A = and —1 can be analyzed explicitly. For A = the distance matrix is 
completely random in the sense that off diagonal entries are equally distributed, independent 
random variables. Hence all the permutations have the same probability of being the optimal 
tour (except those which have some 1-cycle that are excluded) [15] and studying the structure 
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Figure 1 - Mean value of the number of cycles of the optimal solution for the assignment problem 
against A for different values of the number of cities N. The inset is the theoretical (continuous line) 
and experimental (dots) mean value of the number of cycles divided by log(iV) as a function of N, 
for A = — 1. Error bars represent three standard errors of the mean. 



of cycles of the optimal solution is the same that to study the structure of cycles of the 
corresponding subset of the permutation group. The latter can be done with the help of the 
associated Stirling numbers of the first kind d% (N, k) defined as the number of permutations 
of N elements having k cycles, all of which of length r > 2 [16]- [18]. Using the generating 
function of ds(N, k) we can calculate exactly the expected value of n c as: 



(»c) A =o = - 



' d N / r (log(l-x)+x)exp(-a:)\ 




dx N \ l—x J 


. x=0 


[ d N ( exp(-x)\ 
[dx N \ l—x ) 






x=0 





= H N -l + Y,C\N-\ (3) 

8=1 

where Hn = X)m=i(V m ) is the harmonic series and the first coefficients in the expansion in 
© are Q = 1, -1/2, -1/6, 1/4, 8/15, 1/12, -85/42, -125/24, 13/90,479/10, 5800/33 for 
i = 1, . . . , 11. In order to check the code of our simulations, in Fig. [21 we plot the expected 
values of the mean number of cycles at A — ({n c ) A=0 ) obtained in our simulations (dots) and 
compared with the theoretical results of 10 (continuous line). 

Note that this result is based, only, in the fact that all distances are identically distributed 
independent random variables, and is independent of the actual probability distribution. That 
is, a change of the probability will affect, only, to the value of the length of the tour and not 
to the value of (n c ). Should we use random numbers also for ck t i (recall (J2J) then 1-cycles 
would be allowed and the average number of cycles (n c )°_ is exactly the Harmonic Series 
H n . So for N — > oo we have that (n c ) x=g = (n c )° x _ — 1, i- e. for large values of N, to allow 
or to forbid one-cycles only changes in one unit the mean value of n c . 

The case A = — 1 can be studied in a similar way, although here we do not have an exact 
result but an extremely good approximation. The key observation is that the minimal tour 
tends to be made of the smallest distances available, then for anti correlated dij and djj it 
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Figure 2 - Theoretical (continuous line) and experimental (dots) mean value of the number of cycles 
as a function of N, for A = 0. Again, error bars correspond to three standard errors of the mean. 



is very unlikely that both enter in the minimal tour (because one of them will be positive 
and the other negative adding up to zero). The net effect is that the anti correlations tend 
to suppress the appearance of 2-cycles in the optimal tour and this effect will increase with 
increasing N. This has been verified in the numerical simulations, whose results are plotted 
in Fig. 3. There we represent the logarithm of the average number of 2-cycles in the optimal 
tour (P 2 ) as a function of N for A = —1, —0.2 and —0.1. 
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Figure 3 - Numerical values of the logarithm of the probability that a 2-cycle appears in the optimal 
tour as a function of N for A = —1, —0.2 and —0.1. 

As can be seen in Fig. the probability of having a two-cycle in the optimal solution 
decays exponentially with N with a coefficient that depends on A. For A = — 1 the points 
can be fitted to log[P 2 ] = 2.17653 - 0.941985N, i. e. P 2 ~ M~ N with £ = 2.565..., so when 
N > 20 we can neglect the 2-cycles and then all the distances that appears in the optimal 
tour will be uncorrelated. The problem is again reduced to the study of the subset of the 
permutation group, in this case without 1-cycles and 2-cycles. This can be done with the help 
of the associated Stirling numbers ds(N, k) and their generating function. Finally for A = — 1, 
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the value of (n c ) can be computed (up to corrections of order £ N ) as: 



(n c ) 



' d N ( (log(l-«)+p(ar))exp(- 






dx N \ 1—x 


- P (x))y 


£C=0 


r d N ( exp(- P (x))y 

dic-^ \ 1—X J 


Z=0 



i=l 

where p(x) = x + x 2 /2, and the first coefficients in (gj are C t = 2, -3/2, -5/6, 7/4, 106/15, 
67/12, -2627/42, -8633/24, 47929/90,31758/5, 1989059/33 for % = 1, 11. In the inset of 
Fig. ^ we plot, together, the values of (n c ) / \og(N) obtained form g} and the experimental 
points, so we can see that the agreement is excellent. It should be noted that in the limit 
N — > oo we obtain from @ and 10} the relation 

lim (K) A=0 - (tO^-i) = \- (5) 

N— >oo Z 

which states that, for large N, between A = and A = — 1 the value of (n c ) decreases only 0.5 
units. Actually we can show that 

lim {H N - (n c ) J = | VA6[-1,0) (6) 

which tells us that in the thermodynamic limit N — > oo the expected value of the number 
of cycles is independent of A in the range A < 0. The theoretical explanation for this fact 
is again based in the exponential suppression of 2-cycles shown in Fig. 3, and therefore the 
equal probability of all permutations (without 1 or 2-cycles) for producing the optimal tour. 
This, in turn, implies that up to corrections of order 0(^ N ) the results are independent of 
the probability distribution as it happened in the random (A = 0) case. In this sense the 
concrete distribution used for the entries of our random matrix (J2J) that except for A = is 
non uniform, has little effect on the results and the influence weakens more and more as the 
dimension N grows. 
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Figure 4 - Plot of the ratio between the number of 2-cycles (P2) and the total number of cycles versus 
A for different values of the dimension. 

The right half of figure 1, corresponding to positive values of A, is much more poorly 
understood. It is clear from the diagram that (n c ) behaves linearly with A with a slope close 
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to N/2. This linear behavior has corrections near A = 1 that are visible, for instance, in the 
change of the slope in the curve for N = 200 at A = 0.8. It is apparent that the cycles of the 
optimal solutions in this side are dominated by 2-cycles, and as one can see from Fig. 4 

lim —4- = 1 VA > 0. 

jv->oo {n c ) x 

Actually the number of 2-cycles grows linearly with TV for all positive values of A while the 
rest, i. e. the total number of /c-cycles with k ^ 2 grows logarithmically. As A approaches 
from the positive side the linear behavior with TV in has a smaller slope and the average 
number of 2-cycles goes to a constant 1/2 (for large N) in the A = limit. 

In relation to this we can prove that in the symmetric point (A = 1) the probability of 
having a 2/c-cycle with k > 1 vanishes. For if the links in even positions of the cycle add up 
to a length smaller (greater) that those in odd positions then breaking the 2fc-cycle into k 
2-cycles, using the even (odd) links, lowers the length of the tour. As a result instead of a 
2/c-cycle we would have k 2-cycles in the optimal tour. This feature helps to understand the 
abundance of 2-cycles for positive values of A and the change of the slope near A = 1. 




Figure 5 - Plot for the probability (Pjv) of having an optimal solution consisting of one iV-cycle times 
the dimension N versus A. 

The dimerized phase corresponding to A > contrasts with the polymerized one for A < 0. 
In this region there are A^-cycles in the optimal solution with a probability Pjv obtained from 
the corresponding generating function 

Pn(X<0) = ^ + O(C N ) (7) 

while A^-cycles are exponentially suppressed in the dimerized region. The intermediate case 
corresponds to A = with 

P N (X = 0) = ^+O(l/N\). (8) 

The error in (JJJ is produced by the residual presence of 2-cycles as discussed after Fig. 3, 
while that of ©, much smaller, comes from the tail of the series for the exponential. This 
has been verified numerically, the results are shown in Fig. 5. A more detailed analysis of the 
scaling as well as the "critical" properties of the transition is in progress. 
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Notice that in the polymerized phase (A < 0) there are a non-negligible probability of 
finding a solution of the TSP in polynomial time, actually after N runnings of the problem 
one would solve in a time of the order N ■ N 3 a number of TSP's approximately equal to 

e 3/2 = 44 g_ 

In conclusion, we have found a phase transition in the random assignment problem that 
separates two regimes where, while remaining always solvable in polynomial time, the system 
approaches the traveling salesman problem and the simple matching problem respectively. 
This phase transition can be seen as complementary of that which appears in many NP- 
complete problems separating the easy instances from the hard ones ( [2], [19]). In our case, 
the control parameter is the correlation between the distances dij and djj which in turn 
controls the existence, or not, of allowed 2-cycles in the optimal tour. When both distances 
are positively correlated, the 2-cycles are allowed and favored and in the limit N — > oo, the 
(n c ) is dominated by the 2-cycles as seen in Fig. consequently, the AP optimal solution 
for those instances is very far away of that of the TSP solution for the same matrix. On the 
contrary, when dij and dj t % are negatively correlated it is very unlikely that a 2-cycle enters 
in the optimal solution, consequently they are strongly suppressed (see Fig. -OH and and the 
optimal tour is composed of a "few" cycles ( « log(JV)) which means that the AP optimal tour 
for those distances is near the one of the TSP problem for the same matrix. This property 
can be useful as a starting point for designing improved algorithms for solving the traveling 
salesman problem. 

* * * 
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